383 research outputs found

    Incremental Medians via Online Bidding

    Full text link
    In the k-median problem we are given sets of facilities and customers, and distances between them. For a given set F of facilities, the cost of serving a customer u is the minimum distance between u and a facility in F. The goal is to find a set F of k facilities that minimizes the sum, over all customers, of their service costs. Following Mettu and Plaxton, we study the incremental medians problem, where k is not known in advance, and the algorithm produces a nested sequence of facility sets where the kth set has size k. The algorithm is c-cost-competitive if the cost of each set is at most c times the cost of the optimum set of size k. We give improved incremental algorithms for the metric version: an 8-cost-competitive deterministic algorithm, a 2e ~ 5.44-cost-competitive randomized algorithm, a (24+epsilon)-cost-competitive, poly-time deterministic algorithm, and a (6e+epsilon ~ .31)-cost-competitive, poly-time randomized algorithm. The algorithm is s-size-competitive if the cost of the kth set is at most the minimum cost of any set of size k, and has size at most s k. The optimal size-competitive ratios for this problem are 4 (deterministic) and e (randomized). We present the first poly-time O(log m)-size-approximation algorithm for the offline problem and first poly-time O(log m)-size-competitive algorithm for the incremental problem. Our proofs reduce incremental medians to the following online bidding problem: faced with an unknown threshold T, an algorithm submits "bids" until it submits a bid that is at least the threshold. It pays the sum of all its bids. We prove that folklore algorithms for online bidding are optimally competitive.Comment: conference version appeared in LATIN 2006 as "Oblivious Medians via Online Bidding

    A really simple approximation of smallest grammar

    Full text link
    In this paper we present a really simple linear-time algorithm constructing a context-free grammar of size O(g log (N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear assuming that the alphabet Sigma of the input string can be identified with numbers from 1,ldots, N^c for some constant c. Algorithms with such an approximation guarantee and running time are known, however all of them were non-trivial and their analyses were involved. The here presented algorithm computes the LZ77 factorisation and transforms it in phases to a grammar. In each phase it maintains an LZ77-like factorisation of the word with at most l factors as well as additional O(l) letters, where l was the size of the original LZ77 factorisation. In one phase in a greedy way (by a left-to-right sweep and a help of the factorisation) we choose a set of pairs of consecutive letters to be replaced with new symbols, i.e. nonterminals of the constructed grammar. We choose at least 2/3 of the letters in the word and there are O(l) many different pairs among them. Hence there are O(log N) phases, each of them introduces O(l) nonterminals to a grammar. A more precise analysis yields a bound O(l log(N/l)). As l \leq g, this yields the desired bound O(g log(N/g)).Comment: Accepted for CPM 201

    Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially.

    Get PDF
    Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-center variant which, given a set S of points from some metric space and a parameter k0, the algorithms yield solutions whose approximation ratios are a mere additive term \u3f5 away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) D. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones

    New Bell inequalities for the singlet state: Going beyond the Grothendieck bound

    Get PDF
    Contemporary versions of Bell's argument against local hidden variable (LHV) theories are based on the Clauser Horne Shimony and Holt (CHSH) inequality, and various attempts to generalize it. The amount of violation of these inequalities cannot exceed the bound set by the Grothendieck constants. However, if we go back to the original derivation by Bell, and use the perfect anti-correlation embodied in the singlet spin state, we can go beyond these bounds. In this paper we derive two-particle Bell inequalities for traceless two-outcome observables, whose violation in the singlet spin state go beyond the Grothendieck constants both for the two and three dimensional cases. Moreover, creating a higher dimensional analog of perfect correlations, and applying a recent result of Alon and his associates (Invent. Math. 163 499 (2006)) we prove that there are two-particle Bell inequalities for traceless two-outcome observables whose violation increases to infinity as the dimension and number of measurements grow. Technically these result are possible because perfect correlations (or anti-correlations) allow us to transport the indices of the inequality from the edges of a bipartite graph to those of the complete graph. Finally, it is shown how to apply these results to mixed Werner states, provided that the noise does not exceed 20%.Comment: 18 pages, two figures, some corrections and additional references, published versio

    One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

    Get PDF
    Sketches are probabilistic data structures that can provide approximate results within mathematically proven error bounds while using orders of magnitude less memory than traditional approaches. They are tailored for streaming data analysis on architectures even with limited memory such as single-board computers that are widely exploited for IoT and edge computing. Since these devices offer multiple cores, with efficient parallel sketching schemes, they are able to manage high volumes of data streams. However, since their caches are relatively small, a careful parallelization is required. In this work, we focus on the frequency estimation problem and evaluate the performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid. As a sketch, we employed the widely used Count-Min Sketch. To hash the stream in parallel and in a cache-friendly way, we applied a novel tabulation approach and rearranged the auxiliary tables into a single one. To parallelize the process with performance, we modified the workflow and applied a form of buffering between hash computations and sketch updates. Today, many single-board computers have heterogeneous processors in which slow and fast cores are equipped together. To utilize all these cores to their full potential, we proposed a dynamic load-balancing mechanism which significantly increased the performance of frequency estimation.Comment: 12 pages, 4 figures, 3 algorithms, 1 table, submitted to EuroPar'1

    Discovering Dynamic Communities in Interaction Networks

    Full text link

    Mining Top-K Frequent Itemsets Through Progressive Sampling

    Full text link
    We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real bench- mark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and publication in the ECML PKDD 2010 special issue of the Data Mining and Knowledge Discovery journa

    Maximum gradient embeddings and monotone clustering

    Full text link
    Let (X,d_X) be an n-point metric space. We show that there exists a distribution D over non-contractive embeddings into trees f:X-->T such that for every x in X, the expectation with respect to D of the maximum over y in X of the ratio d_T(f(x),f(y)) / d_X(x,y) is at most C (log n)^2, where C is a universal constant. Conversely we show that the above quadratic dependence on log n cannot be improved in general. Such embeddings, which we call maximum gradient embeddings, yield a framework for the design of approximation algorithms for a wide range of clustering problems with monotone costs, including fault-tolerant versions of k-median and facility location.Comment: 25 pages, 2 figures. Final version, minor revision of the previous one. To appear in "Combinatorica

    Tree Compression with Top Trees Revisited

    Get PDF
    We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relatively small overhead, the compressed file can be converted into an in-memory representation that supports basic navigation operations in worst-case logarithmic time without decompression. We also show a much improved worst-case bound on the size of the output of top-tree compression (answering an open question posed in a talk on this algorithm by Weimann in 2012).Comment: SEA 201

    New Approximability Results for the Robust k-Median Problem

    Full text link
    We consider a robust variant of the classical kk-median problem, introduced by Anthony et al. \cite{AnthonyGGN10}. In the \emph{Robust kk-Median problem}, we are given an nn-vertex metric space (V,d)(V,d) and mm client sets {SiV}i=1m\set{S_i \subseteq V}_{i=1}^m. The objective is to open a set FVF \subseteq V of kk facilities such that the worst case connection cost over all client sets is minimized; in other words, minimize maxivSid(F,v)\max_{i} \sum_{v \in S_i} d(F,v). Anthony et al.\ showed an O(logm)O(\log m) approximation algorithm for any metric and APX-hardness even in the case of uniform metric. In this paper, we show that their algorithm is nearly tight by providing Ω(logm/loglogm)\Omega(\log m/ \log \log m) approximation hardness, unless NPδ>0DTIME(2nδ){\sf NP} \subseteq \bigcap_{\delta >0} {\sf DTIME}(2^{n^{\delta}}). This hardness result holds even for uniform and line metrics. To our knowledge, this is one of the rare cases in which a problem on a line metric is hard to approximate to within logarithmic factor. We complement the hardness result by an experimental evaluation of different heuristics that shows that very simple heuristics achieve good approximations for realistic classes of instances.Comment: 19 page
    corecore